import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
plt.close('all')
from IPython.display import display, HTML
HTML('''
<script
src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js ">
</script>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
value="Click here to toggle on/off the raw code."></form>
''')
%run 'Lab_preprocess.ipynb'
%run 'Lab_RepresentativeClustering.ipynb'
%run 'Lab_Hierarchal.ipynb'
%run 'Lab_DensityClustering.ipynb'
I. Executive Summary
Our analysis delves into the World Development Indicators, a comprehensive dataset with diverse global data, focusing specifically on educational aspects. The primary objectives were to unearth patterns in global educational data, understand the diversity of educational systems, and relate foundational education to overall educational outcomes. The desired outcome was to categorize countries based on shared educational traits to aid in benchmarking and inform policy decisions.
Methodologically, we utilized K-Medoids for its robustness in handling data anomalies and Ward's method in hierarchical clustering to minimize the sum of squared differences within clusters. We also applied Density-Based Spatial Clustering of Applications with Noise (DBSCAN) for its efficacy in grouping data points by density.
Our representative clustering identified centroids Morocco, Hungary, and Guinea, reflecting a spectrum of global educational realities. Morocco and Hungary, while showing high enrollment rates, differ in student retention beyond primary education, especially for girls. Guinea represents more fundamental systemic challenges, evident in its low secondary enrollment and high repetition rates. These insights guide targeted recommendations: dropout prevention strategies for Morocco's cluster, equitable support across socio-economic backgrounds in Hungary's cluster, and significant systemic investments for Guinea's cluster.
A purity test combining hierarchical and DBSCAN methods gave us a robust score of 81.37%, affirming the effectiveness of both methods in clustering similar data points. Hierarchical clustering, especially with Ward Linkage, adeptly complemented DBSCAN, identifying its outliers as a distinct cluster. This synergy underscores the utility of employing both methods for an in-depth analysis.
In conclusion, our clusters broadly delineate more and less developed education systems, but it's vital to recognize the unique educational strengths and challenges within each country in a cluster. This nuanced understanding is essential for crafting effective educational strategies, especially in contexts similar to Guinea's cluster, as observed in the Philippines, highlighting the need for infrastructural and curricular reforms.
II. Introduction
| Column Name | Description |
|---|---|
| Country Code | Unique code for each country |
| Year | Year of data collection |
| Various Indicators | Different educational indicators like duration of compulsory education, intake ratios, school ages, enrollment percentages, etc. |
| Indicator Name | Descriptive name of the educational indicator |
| Indicator Code | Unique code for each educational indicator. These will serve as our features for our clustering |
It includes things like how long kids have to stay in school, how many boys and girls are enrolled, the ages they
start at, and if students are older than usual for their grade or repeating a year.
|
|
III. Methodology
1. Data Preprocessing
DATASET LIMITATIONS AND ASSUMPTIONS
- The dataset contains many null values so we had to reduce countries to 102.
- All features are directly related to education based on WDI themselves.
- All features are either ratios or % of GDP or total population as we to avoid as much absolute values as we can so we could compare them effectively.
1. WDI.db contains all Indicators that are related to education based on WDI's descriptions and types set to each Indicator.¶
df_ed.tail()
| Indicator Name | |
|---|---|
| 139 | trained teachers in secondary education, femal... |
| 140 | trained teachers in secondary education, male ... |
| 141 | trained teachers in upper secondary education ... |
| 142 | trained teachers in upper secondary education,... |
| 143 | trained teachers in upper secondary education,... |
Here we limited the Indicator Name to just those related to education.
df_main.head()
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | Unnamed: 66 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | africa eastern and southern | AFE | access to clean fuels and technologies for coo... | EG.CFT.ACCS.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 16.936004 | 17.337896 | 17.687093 | 18.140971 | 18.491344 | 18.825520 | 19.272212 | 19.628009 | NaN | NaN |
| 1 | africa eastern and southern | AFE | access to clean fuels and technologies for coo... | EG.CFT.ACCS.RU.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 6.499471 | 6.680066 | 6.859110 | 7.016238 | 7.180364 | 7.322294 | 7.517191 | 7.651598 | NaN | NaN |
| 2 | africa eastern and southern | AFE | access to clean fuels and technologies for coo... | EG.CFT.ACCS.UR.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 37.855399 | 38.046781 | 38.326255 | 38.468426 | 38.670044 | 38.722783 | 38.927016 | 39.042839 | NaN | NaN |
| 3 | africa eastern and southern | AFE | access to electricity (% of population) | EG.ELC.ACCS.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 31.794160 | 32.001027 | 33.871910 | 38.880173 | 40.261358 | 43.061877 | 44.270860 | 45.803485 | NaN | NaN |
| 4 | africa eastern and southern | AFE | access to electricity, rural (% of rural popul... | EG.ELC.ACCS.RU.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 18.663502 | 17.633986 | 16.464681 | 24.531436 | 25.345111 | 27.449908 | 29.641760 | 30.404935 | NaN | NaN |
5 rows × 67 columns
You will also notice that Country Name contains regions representing multiple countries like africa eastern and southern, africa western and central, arab world, etc. and titles that represent various countries like low-income countries.
2. Merge indicators and indicator description with main dataset.¶
df_test.head()
| Indicator Name | Country Name | Country Code | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | Unnamed: 66 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | adjusted net enrollment rate, primary (% of pr... | africa eastern and southern | AFE | SE.PRM.TENR | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | adjusted net enrollment rate, primary (% of pr... | africa western and central | AFW | SE.PRM.TENR | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | adjusted net enrollment rate, primary (% of pr... | arab world | ARB | SE.PRM.TENR | NaN | NaN | NaN | NaN | NaN | NaN | ... | 84.21832 | 84.25430 | 84.03523 | 84.53258 | 85.14375 | 85.38422 | NaN | NaN | NaN | NaN |
| 3 | adjusted net enrollment rate, primary (% of pr... | caribbean small states | CSS | SE.PRM.TENR | NaN | NaN | NaN | NaN | NaN | NaN | ... | 89.77977 | 89.57198 | 90.92441 | 90.48512 | 89.39624 | 88.92917 | NaN | NaN | NaN | NaN |
| 4 | adjusted net enrollment rate, primary (% of pr... | central europe and the baltics | CEB | SE.PRM.TENR | NaN | NaN | NaN | NaN | NaN | NaN | ... | 94.01037 | 93.41415 | 93.45411 | 93.06906 | 92.81936 | 91.02484 | NaN | NaN | NaN | NaN |
5 rows × 67 columns
3. Put Indicator Codes in their respective columns since they will be our features.¶
4. We chose 2011 because it's the most complete (year with the least Null values).¶
5. Limit the Country Code to just countries since the previous table includes non-countries or regions representing multiple countries.¶
Since the dataset initially had Years for columns while the Indicator Name was listed in rows, we interchange them and drop unnecessary columns.
df_2011.head()
| Country Code | Year | SE.ADT.1524.LT.FE.ZS | SE.ADT.1524.LT.FM.ZS | SE.ADT.1524.LT.MA.ZS | SE.ADT.1524.LT.ZS | SE.ADT.LITR.FE.ZS | SE.ADT.LITR.MA.ZS | SE.ADT.LITR.ZS | SE.COM.DURS | ... | SE.XPD.CTER.ZS | SE.XPD.CTOT.ZS | SE.XPD.PRIM.PC.ZS | SE.XPD.PRIM.ZS | SE.XPD.SECO.PC.ZS | SE.XPD.SECO.ZS | SE.XPD.TERT.PC.ZS | SE.XPD.TERT.ZS | SE.XPD.TOTL.GB.ZS | SE.XPD.TOTL.GD.ZS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 41 | ABW | 2011 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 13.0 | ... | 100.000000 | 100.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 21.750540 | 6.11913 |
| 143 | AFG | 2011 | 32.113220 | 0.51897 | 61.879070 | 46.990051 | 17.017839 | 45.417099 | 31.448851 | 9.0 | ... | 77.549629 | 82.625092 | 12.21159 | 61.97491 | 12.56712 | 26.62432 | 96.09478 | 8.98621 | 16.048429 | 3.46201 |
| 245 | AGO | 2011 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.940000 | 3.03000 |
| 296 | ALB | 2011 | 98.856239 | 1.00126 | 98.731361 | 98.791191 | 95.691483 | 98.008163 | 96.845299 | 8.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 12.840000 | 3.08000 |
| 347 | AND | 2011 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 10.0 | ... | 94.550323 | 98.824097 | 17.79961 | 28.80875 | 13.52729 | 21.35985 | NaN | 3.84056 | NaN | 2.98706 |
5 rows × 139 columns
6. Remove the Nulls by setting thresholds we're comfortable enough to impute. In this case, it's okay if we drop countries with null values on the features we selected. Since we wanted to retain as much features as we can, we opted for high thresholds in features.¶
print("After cleaning, we are left with: ", df_country.shape)
After cleaning, we are left with: (102, 27)
7. We performed imputation using IterativeImputer with a LinearRegression() estimator¶
print(f'Current Number of Nulls: {X.isnull().sum().sum()}')
X.tail()
Current Number of Nulls: 0
| SE.COM.DURS | SE.ENR.PRIM.FM.ZS | SE.ENR.SECO.FM.ZS | SE.PRM.AGES | SE.PRM.DURS | SE.PRM.ENRL.FE.ZS | SE.PRM.ENRR | SE.PRM.ENRR.FE | SE.PRM.ENRR.MA | SE.PRM.GINT.FE.ZS | ... | SE.PRM.OENR.ZS | SE.PRM.PRIV.ZS | SE.PRM.REPT.ZS | SE.SEC.AGES | SE.SEC.DURS | SE.SEC.ENRL.FE.ZS | SE.SEC.ENRL.GC.FE.ZS | SE.SEC.ENRR | SE.SEC.ENRR.FE | SE.SEC.ENRR.MA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 97 | 8.0 | 0.99054 | 0.917890 | 6.0 | 5.0 | 48.77150 | 101.049561 | 100.559967 | 101.520119 | 100.08062 | ... | 5.76218 | 11.381853 | 2.35584 | 11.0 | 7.0 | 47.037630 | 48.466150 | 88.338303 | 84.496590 | 92.055481 |
| 98 | 11.0 | 1.01228 | 0.970870 | 6.0 | 4.0 | 48.87233 | 100.632607 | 101.264503 | 100.035919 | 105.36214 | ... | 7.94891 | 0.539720 | 0.06396 | 10.0 | 7.0 | 48.045860 | 49.270140 | 93.503242 | 92.088188 | 94.851112 |
| 99 | 14.0 | 0.97039 | 1.542867 | 6.0 | 6.0 | 48.20959 | 113.254959 | 111.518219 | 114.920952 | 100.75008 | ... | 9.80920 | 16.491699 | 5.38069 | 12.0 | 6.0 | 61.623135 | 60.108850 | 105.427803 | 127.998695 | 83.583115 |
| 100 | 12.0 | 0.97978 | 0.992780 | 7.0 | 4.0 | 48.34564 | 94.538162 | 93.550720 | 95.481422 | 94.23476 | ... | 1.02560 | 3.002763 | 0.00431 | 11.0 | 8.0 | 48.602320 | 50.412313 | 90.215843 | 89.881271 | 90.534523 |
| 101 | 14.0 | 0.97535 | 1.093720 | 6.0 | 6.0 | 48.26757 | 102.418861 | 101.112923 | 103.668114 | 95.77170 | ... | 9.40757 | 17.549299 | 3.52591 | 12.0 | 5.0 | 51.212060 | 51.295100 | 83.746429 | 87.575607 | 80.071411 |
5 rows × 25 columns
8. We conduct PCA on the 25 education features.¶
We set our n_components to 7 which already explains 99% of the variance.
country_pca.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 13.768181 | -15.374378 | -10.625675 | 2.439129 | -5.916001 | 0.612051 | 3.656233 |
| 1 | 30.939147 | 23.383609 | 11.415078 | -11.658356 | -3.554324 | -1.906466 | -0.194884 |
| 2 | 19.428307 | -2.824350 | -14.371960 | 7.583378 | 2.049612 | -1.291544 | 0.425541 |
| 3 | 25.801059 | -7.061240 | -9.232107 | -4.382353 | -12.541582 | 6.097702 | 3.665695 |
| 4 | 1.692947 | -15.695533 | -15.577955 | 2.456936 | -2.833778 | 0.855853 | 1.027908 |
2. Representative Clustering: K-Medoids
K-Medoids is a clustering algorithm that, unlike K-Means, chooses actual data points as cluster centers, which are known as medoids. This method is particularly robust as it selects the most centrally located point in a cluster, ensuring that the centers are representative of the actual data distribution and not skewed by outliers.
The algorithm is as follows:¶
Algorithm GenericMedoids(Database: $D$, Number of representatives: $k$)
begin
Initialize representative set $S$ by selecting from $D$;
repeat
$\quad$Create clusters ($C_1, ..., C_k$) by assigning each point in $D$ to closest representative in $S$ using the distance function $Dist(\cdot,\cdot)$
$\quad$Determine a pair $x_i \in D$ and $y_j \in S$ such that replacing $y_j$ with $x_i$ leads to the greatest possible improvement in objective function
$\quad$Perform the exchange between $x_i$ and $y_j$ only if improvement is positive
until no improvement in current iteration;
return $(C_i, ..., C_k)$;
end
Opting for K-Medoids over K-Means is advisable when the dataset contains anomalies or outliers, as K-Medoids is less sensitive to such variations. In the context of our dataset, which likely includes noise and outliers, K-Medoids provides a more reliable clustering by anchoring the clusters to genuine, observed data points, making the resulting groupings more interpretable and applicable for real-world scenarios. For Choosing Optimized number of cluster we will employ the following:
- Sum of squares distances to centroid (SSD):
Smaller values suggest better clustering $$ \text{SSD} = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2 $$ - Calinski-Harabasz index (CH):
The higher the value of this measure, the more defined the clusters are. $$s_k = \frac{B_k/(k-1)}{W_k/(n-k)}$$ - Silhouette coefficient (SC):
A value between 0.5 and less than 1 means a more defined cluster. $$ S_i = \frac{Dmin^{out}_i − Davg^{in}_i}{\max\{{Dmin^{out}_i}, Davg^{in}_i\}}$$ - Davies-Bouldin Index (DB):
Small values of $DB$ imply compact and separated clusters $$DB = \frac{1}{k} \sum_i R_i = \frac{1}{k} \sum_i \max_{i \neq j}{R_{ij}}$$ - Gap Statistics (GS):
The $k$ before the sudden rate of change $$\text{Gap}_n(k) = \frac{1}{b} \sum_i^b \log(\bar{W}_{k,i}) - \log(\bar{W}_k),$$
3. Heirarchical Clustering: Ward's Method
Ward's Method defines the distance between two clusters, $A$ and $B$, as the amount the sum of squares will increase when we merge them:
$$ \Delta(A, B) = \sum_{i \in A \bigcup B} \|x_i - m_{A \bigcup B}\|^2 - \sum_{i \in A} \|x_i - m_A\|^2 - \sum_{i \in B} \|x_i - m_B\|^2 $$
where:
- $m_j$ is the center of cluster $j$
- $n_j$ is the number of points in it
- $\Delta$ is the merging cost of combining the clusters $A$ and $B$.
Starting from individual points as a cluster, the method merges them while trying to minimize the growth of $\Delta$. Given two pairs of clusters whose centers are equally far apart, Ward’s method will prefer to merge the smaller ones.
Using Ward's method for our dataset is a strategic choice because it helps us form clusters by minimizing the sum of squared differences within all clusters. This means that countries grouped together have similar educational metrics, which is exactly what we're looking for. By applying Ward's method, we can confidently say that countries within each cluster have comparable education indicators, making it easier for us to draw insights and make recommendations. It's especially useful for spotting distinct patterns or outliers in our data, guiding us in providing tailored suggestions for educational improvements.
4. Density-Based Clustering: Density-based Spatial Clustering of Applications with Noise (DBSCAN)
DBSCAN is a clustering algorithm that groups points based on density. It labels each point in a dataset as a core point, border point, or noise.
- core point: points that have at least $MinPts$ in its neighborhood
- border point: points that do not have at least $MinPts$ neighbors but have a core point as its neighbor
- noise point: points that do not have at least $MinPts$ neighbors and do not have a core point as its neighbor
The choice of 𝜖 and MinPts is crucial. For example, in spatial datasets, if the maximum interaction distance is 100 meters, setting 𝜖 to 100 meters and MinPts to a value like 3 would group points into clusters only if there are at least three within this range. This method effectively separates denser regions (clusters) from less dense areas (noise).
Input:
- $D$: a dataset containing $n$ objects
- $\epsilon$ : the radius parameter
- $MinPts$: the neighborhood density threshold
Output: A set of density-based clusters
Method:
mark all objects as unvisited;
do
randomly select an unvisited object $p$;
mark $p$ as visited;
if the $\epsilon$-neighborhood of $p$ has at least $MinPts$ objects
create new cluster $C$, and add $p$ to $C$;
let $N$ be the set of objects in the $\epsilon$-neighorhood of $p$;
for each point $p'$ in $N$
if $p'$ is unvisited
mark $p'$ as visited;
if the $\epsilon$-neighorhood of $p'$ has at least $MinPts$ points,
add those points to $N$;
if $p'$ is not yet a member of any cluster, add $p'$ to $C$;
end for
output C;
else mark $p$ as noise;
until no object is unvisited;
end
DBSCAN is a practical choice for our dataset because it automatically detects the number of clusters and is excellent at dealing with outliers, labeling sparse points as noise. This method's ability to handle arbitrary cluster shapes and sizes makes it versatile, especially since we don't need to specify how many clusters we expect in advance. Its robustness to anomalies ensures that the clusters formed are genuinely representative of significant trends in our data, making it a suitable tool for analyzing educational datasets that may contain irregularities or unusual patterns. For this report, the hyperparameters used were chosen via experemting with different possible hyperparameters. In this case the parameters eps will be 38 and the min_samples will be 11.
5. External Validation: Cluster Purity Test
Cluster Purity Test is an External Validation Criteria that measures the quality of two different algorithm and is given by: $$ \text{Purity} = \frac{\sum_{j=1}^{k_d} P_j}{\sum_{j=1}^{k_d} M_j}. $$
Where,
$$
N_i = \sum_{j=1}^{k_d} m_{ij} \\
M_j = \sum_{i=1}^{k_t} m_{ij} \\
P_j = \max_i m_{ij}
$$
$m_{ij}$ is the number of datapoints that are mapped from cluster $i$ to cluster $j$
$N_i$ and $M_j$ is the number of data points in the cluster $i$ and $j$ respectively
$P_j$ in the dominant class of cluster $j$
IV. Results and Discussion
I. Representative Clustering
Choosing the optimized $k$ for K-Medoids Clustering¶
medoids_plot_internal
Interpretation
1. SSE (Sum of Squared Errors) is consistently decreasing, which suggests that adding more clusters will always improve the fit but may lead to overfitting.
2. CH (Calinski-Harabasz has been flat 0.
3. The SC (Silhouette coefficient) doesn't show a strong peak, which typically would indicate an optimal number of clusters. In this case, the k closest to 0.5 is 3.
4. The DB (Davies-Bouldin) index, which should be minimized, shows that k=3 has the lowest DB index which aligns well with SC.
5. The Gap statistic, which compares the within-cluster dispersion with that expected under a null reference distribution of the data, will be seen on the k that will provide the biggest change. In this case, it would be 4 followed by 3.
We would choose the k value between 3 or 4, for parsimony's sake, we selected 3.
K-Medoids Scatter Plot¶
medscatter
medoids_3d
Centroids of each cluster in K-Medoids¶
We're using this code to figure out which countries stand at the center of each cluster we've created with our K-Medoids model. By calculating the distances between each country's data point (after we've transformed them with PCA) and the center points of the clusters (the centroids), we can pinpoint exactly which countries are the closest to these central spots. These countries are our centroids.
Once we have these indices, they tell us which countries are the most representative of their respective clusters. It implies that these countries' educational indicators, those features we've analyzed, are central to the characteristics that define each cluster. Essentially, we're identifying the countries that best embody the common traits of their group, which helps us understand the unique educational profiles that exist across the globe.
selected_rows
| SE.COM.DURS | SE.ENR.PRIM.FM.ZS | SE.ENR.SECO.FM.ZS | SE.PRM.AGES | SE.PRM.DURS | SE.PRM.ENRL.FE.ZS | SE.PRM.ENRR | SE.PRM.ENRR.FE | SE.PRM.ENRR.MA | SE.PRM.GINT.FE.ZS | ... | SE.PRM.OENR.ZS | SE.PRM.PRIV.ZS | SE.PRM.REPT.ZS | SE.SEC.AGES | SE.SEC.DURS | SE.SEC.ENRL.FE.ZS | SE.SEC.ENRL.GC.FE.ZS | SE.SEC.ENRR | SE.SEC.ENRR.FE | SE.SEC.ENRR.MA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 63 | 9.0 | 0.94839 | 0.86277 | 6.0 | 6.0 | 47.41941 | 110.742477 | 107.737221 | 113.600243 | 104.13425 | ... | 13.22256 | 11.76751 | 8.29433 | 12.0 | 6.0 | 45.28152 | 45.70542 | 66.517967 | 61.522942 | 71.309036 |
| 46 | 13.0 | 0.99531 | 0.98200 | 7.0 | 4.0 | 48.42716 | 100.015503 | 99.773361 | 100.243927 | 97.84767 | ... | 3.07440 | 9.19171 | 1.78200 | 11.0 | 8.0 | 48.25716 | 50.23067 | 96.799271 | 95.897491 | 97.655731 |
| 39 | 6.0 | 0.85185 | 0.62778 | 7.0 | 6.0 | 45.60272 | 87.285530 | 80.251091 | 94.208328 | 89.27442 | ... | 11.35905 | 28.89716 | 12.66202 | 13.0 | 7.0 | 38.23318 | 37.86940 | 37.603031 | 28.957790 | 46.127239 |
3 rows × 25 columns
centroids_plot()
Given that we're looking at 2011 education data, we look at some indicators that clearly distinguish each centroid country from each other with regards to the educational systems and situation. These indicators give us a snapshot of each country's approach to education, and to further researcng on them, we deepdive into the broader societal and cultural contexts.
Morroco (MAR):
| Indicator Code | Indicator Name | Description | References |
|---|---|---|---|
| SE.PRM.ENRR | School enrollment, primary, % gross | Morocco's high rate (over 110%) indicates efforts to enroll all eligible children, including those outside the typical age range, possibly due to late starts or re-enrollment. However, this rate does not necessarily translate to completion rates or quality education. | Scholaro |
| SE.PRM.GINT.FE.ZS | Gross intake ratio in first grade of primary education, female, % of relevant age group | The figure surpasses 104, suggesting a commendable effort toward gender equality at the entry level of education. | World Bank |
| SE.PRM.NENR | School enrollment, primary, % net | A net enrollment rate around 93% signifies a strong drive to ensure children of official primary school age are attending school, although issues of quality can undermine the benefits of this high enrollment. | Broken Chalk |
| SE.PRM.REPT.ZS | Repeaters, primary, % of total enrollment | The low rate of repetition shows efficiency in Morocco's primary education, suggesting that most students are progressing through grades as expected. | - |
Hungary (HUN):
| Indicator Code | Indicator Name | Description | References |
|---|---|---|---|
| SE.SEC.ENRL.GC.FE.ZS | School enrollment, secondary, general, % female | Nearly 97% of Hungarian females are enrolled in secondary education, indicating a high level of gender parity. However, socio-economic status significantly impacts educational participation. | OECD |
| SE.PRM.PRIV.ZS | School enrollment, primary, private % of total primary | With a low value around 9%, Hungary's reliance on public education over private indicates a strong state educational system. | European Commission |
| SE.PRM.REPT.ZS | Repeaters, primary, % of total enrollment | The low repetition rate around 1.8% in primary education suggests efficiency within the educational pathway. | - |
| SE.SEC.DURS | Secondary education, duration, years | The longer duration of secondary education in Hungary reflects its in-depth educational approach. | - |
Guinea (GIN):
| Indicator Code | Indicator Name | Description | References |
|---|---|---|---|
| SE.SEC.ENRR | School enrollment, secondary, % gross | At about 37%, Guinea's rate is significantly lower than that of Hungary and Morocco, reflecting challenges in progressing from primary to secondary education. This aligns with insights that highlight Guinea's struggles with literacy rates and educational access, particularly for girls. | Broken Chalk |
| SE.PRM.OENR.ZS | Over-age students, primary, % of enrollment | The high percentage (over 11%) indicates that a significant number of students are older than the typical age for their grade level, which may imply interruptions in educational progression and systemic issues within the education sector. | - |
| SE.PRM.REPT.ZS | Repeaters, primary, % of total enrollment | The high value (over 12%) suggests inefficiencies, with many students needing to repeat grades, possibly due to quality of education or socio-economic factors. | - |
| SE.PRM.PRIV.ZS | School enrollment, primary, private % of total primary | A higher reliance on private primary education could reflect gaps in the public education system. | - |
Range Plot for different features in K-Medoids¶
med_range
| Indicator Code | Indicator Name | Analysis for Cluster 0 | Analysis for Cluster 1 | Analysis for Cluster 2 |
|---|---|---|---|---|
| SE.COM.DURS | Compulsory education duration | Ranges from 6 to 16 years, averaging nearly 10 years | Requires about 13 years | Centers around 6 years |
| SE.ENR.PRIM.FM.ZS | Gross intake ratio in first grade of primary education, female | High and practically universal | High and practically universal | Lowest, indicating barriers to female education access |
| SE.ENR.SECO.FM.ZS | Gross intake ratio in first grade of secondary education, male | Majority of eligible boys are enrolling | Majority of eligible boys are enrolling | Lower and more variable |
| SE.PRM.AGES | Primary school starting age | Approximately age 6 | Approximately age 6 | Somewhat younger onset |
| SE.PRM.DURS | Primary education duration | 6-year average | Longer than the 6-year average | 6-year average |
| SE.PRM.ENRL.FE.ZS | Primary education, pupils % female | High percentage of female students | High percentage of female students | Lower mean, suggesting gender disparities |
| SE.PRM.ENRR | School enrollment, primary % gross | Reaching 100%, indicating over-enrollment | High enrollment rates | Lower mean, indicating under-enrollment |
| SE.PRM.ENRR.FE | School enrollment, primary, female % gross | Significant female enrollment rates | Significant female enrollment rates | More variable and usually lower rate |
| SE.PRM.GINT.FE.ZS | Gross intake ratio in first grade of primary education, female | High intake ratio, nearly universal coverage | Over-enrollment of females | Lower mean, indicating enrollment issues |
| SE.PRM.NENR | School enrollment, primary % net | Almost universal net enrollment | Lower average | The lowest, suggesting high out-of-school youth |
| SE.PRM.OENR.ZS | Over-age students, primary % of enrollment | Lower average of over-aged pupils | Greater rates, showing age-grade disparities | Greater rates, showing age-grade disparities |
| SE.PRM.PRIV.ZS | School enrollment, primary, private % of total primary | - | - | - |
| SE.PRM.REPT.ZS | Repeaters, primary % of total enrollment | Low repeater rate, effective grade advancement | Greater rates, more children repeating grades | Greater rates, more children repeating grades |
| SE.SEC.ENRL.FE.ZS | Secondary education, general pupils % female | - | Highest average secondary enrollment rate for females | - |
| SE.SEC.ENRR | School enrollment, secondary % gross | High average rate, successful primary to secondary transition | Lower average, mirrors Hungary's system | Lowest mean, indicating drop-off rates post-primary education |
In conclusion, Cluster 1, which stands for Hungary, frequently exhibits the highest average values across the educational measures, pointing to a robust and comprehensive educational framework. Morocco is represented by Cluster 0, which has strong enrollment numbers but struggles to keep up quality and lower dropout rates. With low enrolment rates, high repetition rates, and overage pupils, Cluster 2, which is representative of Guinea, presents the greatest difficulties and emphasizes the need for focused educational changes. While Clusters 0 and 1's discrepancies, especially in Cluster 1's higher results, highlight the contrasts in educational quality and system efficiency between Morocco and Hungary, their parallels also point to shared strengths in educational enrollment. Metrics highlighting Guinea's urgent educational needs set Cluster 2 apart.
II. Hierarchical Clustering
Choosing the right threshhold base on the dendogram of Ward's Method¶
dendro_original
dendrogram_levelled
Looking at the dendrogram, it's suggesting two main groups because the distance suddenly jumps a lot when going from two to one cluster. This big jump often means that forcing these two clusters together would be a bad fit — they're just too different, which could imply that one cluster might be quite different or an 'outlier' compared to the other. Another way to look at this is that these 2 clusters can clearly distinguish how different the early education status of countries in each cluster.
We might be looking at countries with unique educational systems or challenges that set them apart from the rest. That's pretty crucial to know because our goal is to identify different educational profiles for better policy and investment decisions. The clusters help us tailor our recommendations for each group based on their shared traits.
Since the biggest distance between two points between 310 and just above 500, we chose our threshhold between between 310 and 500.
We might be looking at countries with unique educational systems or challenges that set them apart from the rest. That's pretty crucial to know because our goal is to identify different educational profiles for better policy and investment decisions. The clusters help us tailor our recommendations for each group based on their shared traits.
Since the biggest distance between two points between 310 and just above 500, we chose our threshhold between between 310 and 500.
ward
comp_3d
The scatter plot further shows us how well the clustering worked. Looking at them, the separation can be clearly seen but there are still overlaps. Additionally, the violet cluster doesn't show compactness while the yellow cluster does further strengthening our hypothesis of one cluster becoming a cluster for otuliers instead of it being a real cluster. It shows both balance and parsimony since we only used 2 clusters and both clusters have almost equal number of points.
Range Plot for different features in Ward's Method¶
ward_range
| Indicator Code | Indicator Name | Analysis for Cluster 0 | Analysis for Cluster 1 |
|---|---|---|---|
| SE.COM.DURS | Compulsory education duration | Broader range peaking at 16 years | Shorter and more consistent duration |
| SE.ENR.PRIM.FM.ZS | Gross intake ratio in first grade of primary education, female | Strong start for female primary education | Slightly higher average intake ratio |
| SE.ENR.SECO.FM.ZS | Gross intake ratio in first grade of secondary education, male | Greater variability | More uniform intake |
| SE.PRM.AGES | Primary school starting age | Wider range of starting ages | Average start around age 6 |
| SE.PRM.DURS | Primary education duration | Averages a longer duration | - |
| SE.PRM.ENRL.FE.ZS | Primary education, pupils % female | Higher peak of female pupils | Similar average percentage |
| SE.PRM.ENRR | School enrollment, primary % gross | - | Higher average gross primary enrollment |
| SE.PRM.ENRR.FE | School enrollment, primary, female % gross | Wider spread and higher maximum value | Averages a higher rate of female enrollment |
| SE.PRM.GINT.FE.ZS | Gross intake ratio in first grade of primary education, female | Larger range | Consistently higher average intake ratio |
| SE.PRM.NENR | School enrollment, primary % net | - | Notably higher average net enrollment |
| SE.PRM.OENR.ZS | Over-age students, primary % of enrollment | More significant challenges with age-grade distortion | Significantly lower average |
| SE.PRM.PRIV.ZS | School enrollment, primary, private % of total primary | Higher reliance on private education | - |
| SE.PRM.REPT.ZS | Repeaters, primary % of total enrollment | Higher and more variable repeater rate | Fewer students repeating grades |
| SE.SEC.ENRL.FE.ZS | Secondary education, general pupils % female | - | Higher average enrollment rate for females |
| SE.SEC.ENRR | School enrollment, secondary % gross | - | Higher average gross enrollment rate |
Cluster 1, encompassing countries like Argentina, Albania, Armenia, Austria, Turkey, Ukraine, Uzbekistan, and Venezuela, is characterized by consistent and higher educational rates. This cluster demonstrates shorter, more uniform compulsory education durations and a higher intake of females in both primary and secondary education, indicating a focus on gender parity. Additionally, higher net primary enrollment rates, fewer over-aged students, and a lower rate of repeaters in these countries suggest efficient educational progress, despite the diverse challenges specific to each nation's context, such as rural access or political influences.
Contrastingly, Cluster 0 includes countries like Burundi, Burkina Faso, Dominican Republic, Djibouti, Bhutan, Ethiopia, Ghana, India, Morocco, and Laos, exhibiting a broader range in educational metrics. This cluster's diverse educational systems are reflected in longer durations of primary education and varied starting ages, indicative of different national policies. The higher reliance on private schooling and variable repeater rates within these countries point to challenges in the public education system>, particularly in developing nations where educational quality and access remain key issues.
In summary, while Cluster 1 represents a more uniform, efficient educational system, indicating stronger policy implementations and emphasis on gender parity, Cluster 0 reveals a landscape of diverse educational challenges and approaches. Each cluster, despite its overarching characteristics, comprises countries with unique educational strengths and weaknesses. This distinction underscores the need for nuanced understanding and targeted educational strategies, recognizing the potential models in Cluster 1 and addressing specific needs highlighted by the diversity in Cluster 0.
Contrastingly, Cluster 0 includes countries like Burundi, Burkina Faso, Dominican Republic, Djibouti, Bhutan, Ethiopia, Ghana, India, Morocco, and Laos, exhibiting a broader range in educational metrics. This cluster's diverse educational systems are reflected in longer durations of primary education and varied starting ages, indicative of different national policies. The higher reliance on private schooling and variable repeater rates within these countries point to challenges in the public education system>, particularly in developing nations where educational quality and access remain key issues.
In summary, while Cluster 1 represents a more uniform, efficient educational system, indicating stronger policy implementations and emphasis on gender parity, Cluster 0 reveals a landscape of diverse educational challenges and approaches. Each cluster, despite its overarching characteristics, comprises countries with unique educational strengths and weaknesses. This distinction underscores the need for nuanced understanding and targeted educational strategies, recognizing the potential models in Cluster 1 and addressing specific needs highlighted by the diversity in Cluster 0.
III. Density-Based Clustering
dbscatter
db_3d
In examining the clusters in hierarchical clustering using Ward's method, it becomes evident that two clusters are identified. Conversely, density-based clustering suggests a single prominent cluster, with additional data points being classified as outliers. Notably, many of these outliers are recognized as part of a cluster within the hierarchical framework which is further reflected as the violet cluster in heirarchical clustering have resemblance with the outliers of density-based clustering.
This discrepancy highlights the distinct methodologies of the clustering techniques: hierarchical clustering's approach tends to group data based on global patterns, while density-based clustering focuses on dense regions of data points.
Looking at the scatter plot for both Hierarchical and Density-Based Clustering we can see that they are almost similar, wherein the outliers in DBSCAN are considered to be a cluster in Hierarchal Clustering. Diving dddeper further, we are going to deploy a Cluster Purity Test to examine how well likely they are the same and explaining each other. In this case, we will consider the outliers as a cluster of its own.
IV. Cluster Purity Test
def purity(y_pred_db, y_pred_ward):
"""Compute the class purity
Parameters
----------
y_true : array
List of ground-truth labels
y_pred : array
Cluster labels
Returns
-------
purity : float
Class purity
"""
matrix = confusion_matrix(y_pred_db, y_pred_ward)
return np.sum(np.amax(matrix, axis=0)) / np.sum(matrix)
score = purity(y_predict_country_com, cluster_labels)
The cluster purity test results for both hierarchical and Density-Based clustering methods yielded a notable score of 81.37%. This implies that 81.37% of the data points in each cluster belong to the same class, indicating a high degree of homogeneity within the clusters. Such a score suggests that both clustering methods have effectively grouped the data points, with a majority of points in each cluster sharing common characteristics or features.
This level of purity is significant, especially considering the complexities and potential irregularities inherent in real-world datasets. It suggests that both hierarchical and Density-Based methods are adept at identifying and grouping similar data points, even in the absence of predefined cluster boundaries or assumptions about the data distribution.
This happened because the Ward Linkage used for Hierarchical Clustering was able to identify the outliers from DBSCAN as a distinct cluster because it focuses on minimizing the increase in within-cluster variance, rather than adhering to a strict density threshold like DBSCAN. This approach allows Ward's method to recognize sparser groups of points as valid clusters, which DBSCAN might label as outliers due to their lower density. Essentially, Ward's Linkage prioritizes overall cluster cohesion over density, enabling it to classify less dense but meaningful groups of data points as clusters.
V. Conclusion and Recommendation
After analyzing the educational data from 2011, our representative clustering has pinpointed key differences and similarities in education systems worldwide. Morocco and Hungary, although having high enrollment rates, differ in their ability to keep students in school past primary education, especially girls. Guinea is dealing with more basic issues, struggling to even get kids consistently through the primary level. By offering actions aligned with our findings, we can help countries within each group to improve their education systems. This isn't just about hitting enrollment targets but making sure that every child gets a quality education and a real shot at a better future.
As for our analysis of Heirarchical Clustering and DB Clustering, our purity test for the hierarchical and Density-Based clustering methods gave us a solid score of 81.37%. This means that a big majority of the points in each cluster really belong together, showing that both methods did a great job in grouping similar data points. What's interesting is how the hierarchical method, especially with Ward Linkage, picked up on the outliers that DBSCAN found and treated them as a separate cluster. This shows us that combining both methods gives us a fuller picture, making it a smart move to use them together in our analysis for more accurate and detailed results.
Cluster 1, featuring countries like Argentina and Austria, shows consistent and higher educational rates with a focus on gender parity and efficient educational progress. However, its diverse challenges, such as rural education access in Austria, illustrate the unique contexts within the cluster. Cluster 0, which are also the outliers in DB, including nations like India and Burkina Faso, displays a broader range in educational metrics, indicative of diverse educational systems and challenges in public education, like access and quality.
While these clusters provide an overview of more developed (Cluster 1) versus developing (Cluster 0) education systems, it's crucial to remember that each country within a cluster has unique educational strengths and weaknesses. This underscores the need for a deeper, more nuanced understanding of each country's specific educational landscape to inform targeted and effective educational strategies.
Implication in the Philippines Context
Considering the educational challenges highlighted in the Philippines, including infrastructure deficiencies, private school closures, and the quality of learning outcomes, it's plausible to suggest that the Philippines might align with the cluster represented by Guinea. This cluster signifies educational systems grappling with fundamental issues that significantly impact the quality and inclusivity of education.
The Basic Education Report 2023 of the Philippines revealed substantial infrastructural deficits, with a considerable number of school buildings requiring major repairs or being marked for condemnation based on iTacloban's report. Moreover, the lack of facilities and resources has been underscored as a critical concern. These challenges mirror those seen in Guinea's cluster, where systemic issues in infrastructure and resource provision hinder the educational process.
Furthermore, issues with the procurement process, a decline in enrollment in private schools, and concerns over curriculum and employability resonate with the broader challenges identified in Guinea's cluster. The Philippines' education sector is also dealing with the aftermath of prolonged school closures and the need for a significant curriculum overhaul, suggesting a potential match with Guinea's cluster profile.
The current state of the Philippines' education system suggests the need for targeted interventions similar to those recommended for Guinea's cluster. These would include infrastructural investments, curriculum reforms to address 21st-century skills, and strategies to improve the overall quality and accessibility of education.
VI. REFERENCES
Amador III, J. (2023, February). The Philippines’ Basic Education Crisis. The Diplomat. Retrieved February 11, 2024, from https://thediplomat.com/2023/02/the-philippines-basic-education-crisis/
Broken Chalk. (n.d.). Beyond the Medina: Unpacking Morocco’s Educational Challenges. Broken Chalk. Retrieved February 11, 2024, from https://brokenchalk.org/beyond-the-medina-unpacking-moroccos-educational-challenges/
Broken Chalk. (n.d.). Challenges in Guinea's Education System. Broken Chalk. Retrieved February 11, 2024, from https://brokenchalk.org/challenges-in-guineas-education-system/
ChatGPT. (2024). Assistance with analysis and phrasing for educational systems project. OpenAI. Retrieved February 11, 2024, from https://openai.com/chatgpt
European Commission. (n.d.). Hungary Overview. Eurydice - European Commission. Retrieved February 11, 2024, from https://eurydice.eacea.ec.europa.eu/national-education-systems/hungary/overview
iTacloban. (2023, January). Basic Education Report 2023. Retrieved February 11, 2024, from https://www.itacloban.com/2023/01/basic-education-report-2023.html
OECD. (n.d.). Education at a Glance 2021: OECD Indicators. OECD iLibrary. Retrieved February 11, 2024, from https://www.oecd-ilibrary.org/docserver/9789264273344-6-en.pdf
PBEd. (2023). State of Philippine Education Report 2023. Philippine Business for Education. Retrieved February 11, 2024, from https://pbed.ph/blogs/47/PBEd/State%20of%20Philippine%20Education%20Report%202023
Scholaro. (n.d.). Morocco Education System. Scholaro. Retrieved February 11, 2024, from https://www.scholaro.com/pro/Countries/Morocco/Education-System
Soleymani, A. (n.d.). Beyond scikit-learn: Is it time to retire K-means and use this method instead? Medium. Retrieved February 11, 2024, from https://medium.com/@ali.soleymani.co/beyond-scikit-learn-is-it-time-to-retire-k-means-and-use-this-method-instead-b8eb9ca9079a
The World Bank. (n.d.). Morocco Overview. The World Bank. Retrieved February 11, 2024, from https://www.worldbank.org/en/country/morocco/overview